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In this study, the author proposed k-nearest neighbor (KNN) based heart 
disease prediction model. The author conducted an experiment to evaluate the 
performance of the proposed model. Moreover, the result of the experimental 
evaluation of the predictive performance of the proposed model is analyzed. 
To conduct the study, the author obtained heart disease data from Kaggle 


machine learning data repository. The dataset consists of 1025 observations of 
which 499 or 48.68% is heart disease negative and 526 or 51.32% is heart 
Keywords: disease positive. Finally, the performance of KNN algorithm is analyzed on 
the test set. The result of performance analysis on the experimental results on 
the Kaggle heart disease data repository shows that the accuracy of the KNN 


Automated diagnosis 


Disease prediction is 91.99%. 
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1. INTRODUCTION 

Heart disease is a condition in which a waxy substance is formed in the coronary arteries. This 
accumulation of plague waxy substance in the arteries makes the blood pumping process to slow down and 
eventually causes death if not [1]. Heart disease is one of the causes of disease and mortality among the 
population of the world. Prediction of cardiovascular disease is regarded as one of the most important research 
areas in clinical data analysis. Now a day, the amount of data in the healthcare centers is large. Machine learning 
algorithms are widely used in object recognition and disease diagnosis [2]. In disease diagnosis machine 
learning algorithm turns a large collection of healthcare dataset into information that can assist to make better 
decisions and predictions. Prediction of disease and developing machine-based diagnostics systems is one of 
the goals of machine learning research that gained importance in the medical research field in support of the 
health experts’ herby improving the precision and accuracy in decision making process during the identification 
and diagnosis of a disease [3]-15] 

One the major problem in heart disease diagnosis is the error during diagnosis process. These errors 
occur due to lack of experienced specialists in the medical field to accurately and precisely identify the heart 
disease. Literature survey [1]-[25], shows that the heart disease is still a serious issue which needs further 
research works in order to address the mortality rate caused by the disease. In this research, we proposed heart 
disease prediction model by employing k-nearest neighbor (KNN) algorithm to and this research is aimed to 
answer the following questions: i) What is the right distance measure that produces the optimal accuracy for 
the KNN on heart disease prediction? ii) What is the performance of KNN algorithm on prediction of heart 
disease? iii) What is the effect of the value of neighbors on the predictive accuracy of KNN on heart disease 
prediction? 
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2. RELATED WORK 

Numerous research works have been conducted which has focus on heart disease identification by 
employing machine-learning algorithms. The research works applied different machine learning algorithms to 
develop a prediction model for classification of the heart disease. Some of the previous research works on heart 
disease prediction are discussed in this section. Gavhane et al. [4], Naive Bayes, decision tree and random 
forest algorithms are applied to Cleveland heart disease dataset. The predictive performance of the algorithms 
is evaluated on the test dataset and random forest algorithm outperformed than the decision tree and Naive 
Bayes algorithm. 

Hasan et al. [5], Gaussian Naive Bayes algorithm is applied to an online University of California, 
Irvine (UCI) heart disease data repository. The algorithm is evaluated against the predictive accuracy and the 
experimental analysis of result shows that the highest accuracy achieved by the Gaussian Naive Bayes on 
prediction of the heart disease is 84.05%. 

Ambekar and Phalnikar [6], a comparative analysis on the predictive performance of machine learning 
algorithms, such as Gaussian Naive Bayes, Logistic regression, random forest and KNN is conducted heart 
disease dataset. The comparison result shows that logistic regression outperformed the other algorithms with 
better accuracy on prediction. 

Pawlovsky [7], heart disease prediction model is proposed by employing convolutional neural 
network (CNN). The accuracy of the proposed heart disease prediction model is evaluated on test dataset and 
the analysis of the result shows that the CNN algorithm achieved a prediction accuracy of 65%. Zunaidi et al. 
[8], KNN is applied to heart disease observations collected from Wisconsin. The authors compared the 
performance of linear and non-linear support vector machine. The result of performance analysis shows that 
the KNN has predictive accuracy of 84.8% on the heart disease classification problem. 

Jothi et al. [9], a comparative study on machine learning algorithms namely, decision tree, random 
forest and multi-layer perception is conducted on the Wisconsin heart disease data repository. The algorithms 
are evaluated against their accuracy on heart disease prediction and the result shows that multi-layer perception, 
neural network is better on prediction of the heart disease. Jabbar ef al. [10], support vector machine is applied 
to the heart disease data repository to develop a heart disease prediction model. The authors applied feature 
selection to improve the prediction performance of the proposed model and result shows that the model has 
accuracy of 56.16%. 

Assegie et al. [11], Naive Bayes is employed to Wisconsin heart disease data repository to predict a 
heart disease. The maximum prediction accuracy achieved by using this model is 87%. A prediction accuracy 
of 87% is acceptable in machine learning and prediction system and hence, Naive Bayes model is better in 
performance and prediction of heart disease. 


3. RESEARCH METHOD 

In this research, the researcher collected heart disease data from Kaggle data repository for training 
and testing the proposed KNN model. For implementation and experimental testing, the researcher employed 
Python 3.7 programming language. A statistical method that is Pearson’s correlation analysis and data 
visualization as well as feature relationship measures are employed for identification and interpretation of heart 
disease data repository to find out the relationship between the class and the features in observations. To 
develop heart disease prediction, model the researcher employed KNN algorithm. Figure 1 demosntrates heart 
disease distribution in the datasset. 


Heart disease distribution in ther dataset 


No heart disease Heart disease 


Figure |. Heart disease patient and non-patient class distribution 
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3.1. Dataset description 

In this study, Kaggle breast cancer data repository used in this study consists of 1025 observations 
and 13 features. Among the 1025 observations, 499 or 48.68% are heart disease negative and 526 or 51.32% 
are heart disease positive. The dataset has no missing feature values. Table 1 summarizes the details of the 
features of heart disease dataset. The dataset observations used in training is 80% and in testing 20% of the 
dataset is used. 

Table 1 demonstrates heart disease dataset features employed for training and testing the KNN model. 
Figure 2 shows the distribution of heart disease patients against non patient class for different slopping 
conditions such as sup sloping, down slopping and flat. As demonstrated in the Figure 2, the number of patients 
is higher when the patient ST-T wave is up sloping. 


Table 1. The Kaggle heart disease data repository features description 


No. Feature Description 
1 Age The age of a person 
2 CP Chest paint (1= typical angina, 2= atypical angina, 3= non-angina pain, 4= asymptotic) 
3 trestbps Resting blood pressure 
4 Chol Serum cholesterol in mg/dl 
5 fbs Fasting blood sugar 
6 restecg Resting electrocardiographic results (values 0, 1, 2) 
7 thalach Maximum heart rate achieved 
8 exang Exercise induced angina (1= yes, 0= no) 
9 Oldpeak ST depression induced by exercise relative to rest 
10 slope The slope of the peak exercise ST segment 
11 ca Number of major vessels (0-3) colored by fluoroscopy 
12 Thal Thalassemia (3= normal, 6= fixed defect, 7= reversible defect) 
13 target Class label (1= patient, 0= not patient) 
Heart Disease Frequency According To Slope 
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Figure 2. Slope vs heart disease positive and negative class 


3.2. Feature correlation model 

The author has employed Pearson’s correlation analysis for visualization of the relationship between 
each feature. This helps to identify the feature that is strongly related to the class feature in the data repository. 
The Pearson’s correlation matrix for each feature of the breast cancer dataset is shown in Figure 3. As illustrated 
in Figure 3, some of the features are highly correlated. For instance, age and total resting blood pressures 
(trestbps) has correlation value 0.27. Similarity cholesterol is highly correlated to age with correlation 
coefficient 0.22. In addition, number of major vessels has high correlation with age with correlation coefficient 
of 0.25. Slope and maximum heart rate achieved has high correlation value of 0.4. In contrast, features such as 
resting electrocardiogram and exercise-induced angina has negative correlation value with age feature. 
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Figure 3. Heart disease feature correlation matrix 


4. RESULT AND DISCUSSION 

In this section, the experimental test results on the proposed model are explained. The predictive 
performance of decision tree and adaptive boosting algorithm is analyzed by employing the performance 
metrics such as accuracy and confusion matrix along with learning curve of the algorithms. Table 2 illustrates 
the accuracy of the proposed KNN model on five random tests. 

As demonstrated in Table 1, the highest accuracy score on five random test is 92.68% with average 
accuracy of x%. The predictive performance of the proposed model is experimented on the training set. The 
predictive accuracy of the proposed model is shown in Figure 4. 


Table 2. Accuracy of KNN 
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Figure 4. Heart disease feature correlation matrix 


4.1. Confusion matrix 

A confusion matrix is a measure the predictive performance of the proposed models in terms of the 
number of correct and incorrect predictions on the test set by the decision tree and adaptive boosting algorithm. 
The confusion matrix of the decision tree and adaptive boosting algorithm is shown in Figure 5. 
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4.2. Training and test accuracy vs k-values 

Learning curves of the proposed model shows the performance of the model on training set for 
different k-values as demonstrated in Figure 6. The Figure 6 demonstrates the training and test set accuracy on 
the y-axis against the k-neighbors on the y-axis. The worst performance of the model is approximately 72.25%, 
which is still acceptable. And the model’s best performance is at 1 neighbor (K = 1) and drops with higher 
values of neighbors. 
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Figure 5. Confusion matrix of KNN model Figure 6. K value vs accuracy for KNN model 


5. CONCLUSION 

In this research, author proposed KNN based model for heart disease prediction by using dataset 
obtained from Kaggle machine learning data repository. The proposed model solves the problem of biased 
classification on imbalanced observation by non-ensemble algorithm through ensemble classifier namely the 
adaptive boosting. The predictive performance of the proposed model is evaluated by employing different 
performance metrics such as accuracy and confusion matrix on the test set. The result of performance analysis 
shows that the adaptive boosting algorithm has better performance than the decision tree. Hence, the adaptive 
boosting algorithm is a better classifier for imbalanced observations where the use of non-ensemble algorithm 
such as decision tree, results in biased prediction towards the majority class yielding better performance on 
prediction of the majority class and poor performance on the minority class. 
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